m
LABORATORY FOR
COMPUTER SCIENCE
MASSACHUSETTS
INSTITUTE OF
TECHNOLOGY
MIT/LCS/TR-535
ON THE SAMPLE COMPLEXITY
OF PAC-LEARNING USING
RANDOM AND CHOSEN
EXAMPLES
Bronwyn Bonnie Eisenberg
March 1992
545 TECHNOLOGY SQUARE, CAMBRIDGE, MASSACHUSETTS 02139
On the Sample Complexity of Pac-learning using
Random and Chosen Examples
by
Bronwyn Bonnie Eisenberg
B.A., English
Princeton University
(1981)
Submitted to the
Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Master of Science in Electrical Engineering and Computer Science
at the
Massachusetts Institute of Technology
May 1991
© Massachusetts Institute of Technology 1991
Signature of Author
Department of Electrical Engineering and Computer Science
March 1, 1992
Certified by
Ronald L. Rivest
Professor of Computer Science and Engineering
Thesis Supervisor
Accepted by
Arthur C. Smith
Chairman, Department Committee on Graduate Students
Abstract
Two protocols used for learning under the pac-learning model introduced by Valiant
are learning from random examples and learning from membership queries. Membership
queries have also been used to efficiently and exactly learn a concept class C that is too
difficult to pac-learn using random examples. We ask whether using membership queries —
in conjunction with or instead of random examples — can serve a new purpose: helping
to reduce the total number of examples needed to pac-learn a concept class C already
known to be pac-learnable using just random examples. We focus on concept classes that
are dense in themselves, such as half-spaces of R B , rectangles in the plane, and the class
I = {[0, a] : < a < 1} of initial segments of [0, 1].
The main results of this thesis are:
1. Adding the option of using membership queries cannot significantly reduce the total
number of examples needed to pac-learn a dense-in-itself concept class C;
ft((l/e)ln(l/£)) random examples are still required to pac-learn C, where c and 6 are
the usual pac-learning parameters. Interestingly, this bound holds for any unknown
probability distribution, unlike the "standard" proof (due to Ehrenfeucht, Haussler,
Kearns, and Valiant [12]), which holds only for a particular distribution constructed
by an adversary.
2. Even if the unknown probability distribution is known to be "smooth", at least
(l/4e)ln(l/£) examples are required to pac-learn C from random examples (only).
3. For the special case of learning half-spaces of an n-dimensional unit simplex, if the
probability distribution is smooth and examples can be chosen as well as drawn ran-
domly, then the number of examples required to pac-learn decreases significantly, to
n 2 (lg(s/e) + 4), and the computational complexity is 0(n 7 s/e).
Portions of this thesis are joint work with Ron Rivest.
Thesis Supervisor: Ronald L. Rivest
Title: Professor of Electrical Engineering and Computer Science
Keywords: Pac-learning, Queries, Drawing examples, Choosing examples
Contents
1 Introduction 4
1.1 Overview 4
1.2 Standard Machine Learning Definitions 5
1.3 Pac-learning Definition 8
1.3.1 Pac-learning error rate 8
1.3.2 Pac-learning complexity 8
1.3.3 Protocols for obtaining examples 10
1.4 New Definitions 10
1.4.1 Dense-in-itself concept classes 10
1.4.2 Smooth probability distributions 13
1.5 Pac-learning Background 13
1.6 The Results of this Thesis 15
2 A Lower Bound When Both Drawing and Choosing Examples 18
2.1 A Lower Bound 18
2.2 Comparison of the Lower Bound to Previous Results 21
3 Smoothness Guarantees 23
3.1 A Lower Bound When Drawing Examples Only 23
3.2 Upper and Lower Bounds When Both Drawing and Choosing Examples . . 25
3.2.1 Using the binary-search approach in one dimension 25
3.2.2 Generalizing the binary-search approach to n dimensions 27
if?
•»
Wfi
4
i
U
Chapter 1
Introduction
1.1 Overview
In 1984, Valiant published A Theory of the Learnable [18]. His paper introduced a model
of learning that has come to be known as probably approximately correct learning, or pac-
learning for short. As proposed by Valiant, a good learning algorithm should, after viewing a
certain number of labeled examples, output a hypothesis that with high probability classifies
almost all unseen examples correctly. Valiant's model has been widely embraced, and many
interesting classes have been shown to be efficiently learnable under the pac-model. (The
notion of efficiency is addressed in more detail later).
In spite of the fact that the pac-model leads to efficient learning algorithms, however,
the most traditional implementation of the model imposes restrictions on the way learning
can be done, and these restrictions potentially drive up the number of labeled examples,
in other words the amount of information, the learning algorithm needs to obtain in order
to compute a hypothesis. This thesis asks the following question: can altering several
of the pac-learning restrictions reduce the number of examples the learning algorithm, or
equivalently the learner, needs to obtain in order to compute a hypothesis?
In the standard pac-model, the learner has no choice over what examples are obtained.
Additionally, the learner knows nothing about the probability distribution under which the
learning is learning: in particular, any probability distribution is possible. In Chapter 2 we
consider what additional efficiency can be gained in Valiant's model if the learner has some
control over what examples it receives. We prove a negative result: in spite of having some
choice over what examples are obtained, the number of examples the learner must obtain
in order to compute a hypothesis is within a constant factor of the number of examples the
learner must obtain in the standard pac-model.
In Chapter 3, we ask whether there is an increase in efficiency if the learner knows ahead
of time that certain probability distributions (for instance, extremely skewed distributions)
are disallowed. Here we show a positive result: if the learner knows that certain probability
distributions are not possible, then giving the learner some control over what examples it
may obtain can lead to a substantial decrease in the number of examples needed in order
to compute a hypothesis.
In Chapter 4, we state conclusions and open problems.
We continue this chapter by presenting some standard machine learning definitions, an
introduction to pac-learning, and some new definitions. In light of our definitions and the
introduction to pac-learning, we then discuss the learning models and main results of this
thesis.
1.2 Standard Machine Learning Definitions
We consider the problem of learning an unknown concept, called the target concept, from
examples. The examples are drawn from a space X , called the sample space. For instance,
we may have X = {0, 1}" or X = R n . In this thesis we focus on continuous sample spaces,
such as R n . We use the terms learner and learning algorithm interchangeably.
A concept c is a subset of the sample space X. For x € X, we say that x G c is a
positive instance (or positive example) of a concept c, and that x £ c is a negative instance
(or negative example) of a concept c. We also use functional notation, writing c(x) = I
for positive examples of a concept c and c(x) = for negative examples; we call c(x) the
label or classification of x (by c). The unknown target concept is denoted c*. A hypothesis
concept is denoted c.
For instance, if X = R™, an example of a concept is a half-space defined by an n - 1-
dimensional hyperplane. Such a hyperplane splits R n into two half-spaces. The subset of
R n consisting of the points on the hyperplane and all points on one side of the hyperplane
(that is, one of the half-spaces) is considered to be the concept. Each of these points is a
positive example of the concept, and each point on the other side of the hyperplane is a
negative example of the concept.
A concept thus is a set of points. Observe that the set, which is potentially infinite,
can, for concept classes of interest here, be represented by a finite set of parameters. For
instance, a half-space is a set with an infinite number of points; it can be represented,
however, by a hyperplane using a finite number of real- valued parameters.
We assume that there is a probability distribution D defined on the sample space X.
There are then two ways to obtain examples. First, the learner can draw examples from
X according to probability distribution D. Here we say that the learner is learning from
random examples. In this case, the learner has no choice over what examples it obtains.
Second, the learner may also, in some cases, make a membership query arbitrarily in X,
in which case the learner is also learning from membership queries. Making a membership
query is also called choosing an example. With a membership query, the learner can ask
about any point x, even if x is in a set that D assigns zero probability.
We assume that the learner is told c t (x) for any example x obtained, so that the learner
knows whether a; is a positive example or a negative example of the target concept. We say
that the learner sees examples in a set S if it either draws a random example from S or
makes a membership query in S.
A concept class C is the set of all concepts of a particular type. For instance, the set of all
half-spaces in R" (each half-space defined by an n - 1-dimensional hyperplane) is a concept
class. Another concept class is the class J = {[0, a] : < a < 1} of initial segments of [0, 1].
We assume that the learner knows a priori the particular concept class C of which the target
concept c» is a member. However, it is not a requirement that the hypothesis concept that
the learner outputs come from this same concept class C. In fact, our proofs do not depend
on the assumption that the learner outputs some member of C as an approximation to c*.
Given two concepts c and c', let c@ c' denote the set (c - c') U (c' - c) of points that
are classified differently by c and c'. This set of points on which c and c' disagree is called
the symmetric difference of c and c'. Given two concepts c and c', we are not interested in
simply the size of the symmetric difference. Rather, we are interested in the probability of
the symmetric difference.
More formally, we define the distance d D between concepts c and c' (with respect to
probability D) as
d D (c,c') = D(c@c') ,
the probability that c(x) ^ c'(x) for a point x randomly drawn according to D. That is,
d D (c,c') is the probability that c and c' disagree on the classification of a randomly drawn
point. Another way to think about this is that d D (c,c') measures the probability associated
with the symmetric difference of c and c'.
Now, given a target concept c» and another concept c, we wish to measure the distance
^d(c,c») between the target concept and c. In this case, distance represents the error (again
in a probabilistic sense) of concept c with respect to target concept c* . More formally, we
define the error rate of a concept c (with respect to the distribution D and the target
concept C,) to be d D (c,c„). Thus, the error rate of a concept c measures the probability
associated with the symmetric difference of c„ and c. We remark that potentially the
symmetric difference of the target concept and c can be a very large set and yet the error
rate d D {c,c t ) can be very small, provided that there is little or no probability associated
with the symmetric difference of c« and c.
Under some probability distributions, it is in fact possible that the symmetric difference
of two concepts c and d is non-empty, and yet has zero probability. In other words, although
the two concepts do disagree on some set of points, there is no probability weight on these
points. In this case, although the two concepts are not the same (that is, they do not classify
all points the same), the error rate d D (c,d) equals zero. Thus we note that in general d D
defines only a bounded pseudometric on C, for any probability distribution D. For d D to
be a proper metric, it must be the case that for any incorrect concept c there is a positive
probability d D (c,d) of drawing a random example that demonstrates that c is incorrect.
If we were to consider only those probability distributions such that c ^ c' implies that
d D (c,c') > 0, then d D would be a bounded metric on C. (In this thesis we do not state the
conditions on measurability that may be required to make our claims go through. See, for
example, Ben-David et al. [6] and Blumer et al. [9] for discussion.)
Often, the hypothesis space of a concept class is uncountably infinite. For instance, the
concept class 1 has an infinite hypothesis space: there are infinitely many initial segments
[0,a] of [0,1]. The notion of the VC dimension of a hypothesis space offers a way of
measuring such an infinite space. More specifically, the VC dimension is a combinatorial
parameter of the hypothesis space. Haussler gives a very clear definition of VC dimension
[13]:
"The VC dimension of a hypothesis space H, denoted VCdim(jff), is defined
to be the maximum number d of instances that can be labeled as positive and
negative examples in all 2 d possible ways, such that each labeling is consistent
with some hypothesis in H [10] [19]."
In other words, the VC dimension is equal to the largest number m of points, such that for
each element of the power set of the m points, the following holds: there exists a hypothesis
in the concept class, such that each of the points in this particular element of the power set
is classified as positive, and the remaining points (from the original set of m points) which
are not in this element of the power set are classified as negative. For example, the VC
dimension of the class I is 1.
1.3 Pac-learning Definition
The core idea of pac-learning is that a learning algorithm should compute with high prob-
ability a hypothesis concept that is accurate to a given error rate. That is, the learner's
goal in the pac-learning model is not to exactly identify the target concept, but rather to
compute a hypothesis concept such that with high probability the hypothesis concept agrees
with the target concept on almost all examples in the sample space. Thus we have Valiant's
notion of probably approximately correct [13] [18].
In the case of "batch" learning (used in the models of this thesis), the learner first sees
all examples, and then computes the hypothesis concept. In the case of "online" learning,
the learner may recompute a hypothesis concept after each example seen. In either case, it
is not sufficient for an algorithm simply to compute and output a hypothesis consistent or
close to consistent with all examples seen thus far; the learner should have confidence that
with high probability almost all future examples will be classified correctly.
1.3.1 Pac-learning error rate
A pac-learning algorithm is given as part of its input an error parameter €, which specifies
the maximum error rate acceptable (this relates to being approximately correct), and a
confidence parameter 6, which specifies the degree of certainty that the hypothesis will
classify all unseen examples to within the specified error rate e (this relates to being probably
approximately correct).
More formally, let c represent the hypothesis concept computed by a run of a pac-learning
algorithm. To pac-learn successfully, the algorithm, given input parameters e and 6, must,
for all probability distributions D, with probability at least 1 - 6 compute a hypothesis c
such that the error rate d D (c t ,c) < e:
PT(d D (c.,c)<e)>l-8. (1.1)
In Valiant's original definition, d D (c*,c) may be less than or equal to e. For the sake of
simplicity, we assume that the error rate must be strictly less than e.
We can now define computational complexity in pac-learning.
1.3.2 Pac-learning complexity
Valiant's goal was to encourage computer scientists to study complexity in learning theory
via "precise computational models of the learning phenomenon"[18]. Valiant delineates
two components of such learning models, which describe frameworks for learning target
concepts: (1) a protocol for receiving information about the target concept, and (2) an
algorithm for computing a hypothesis concept based on this information.
We define sample complexity to be the number of examples the learner needs to see
to compute a hypothesis concept. Then complexity in learning theory, as proposed by
Valiant, is measured as a function both of sample complexity (related to (1) above), and of
computational, or time, complexity, that is, how much computing a learning algorithm must
do to output a good hypothesis concept (related to (2) above). The goal in pac-learning is
to use only polynomial sample and time complexity to learn a target concept. This thesis
focuses in particular on issues related to sample complexity.
For a pac-learning algorithm to have polynomial sample complexity, it must have sample
complexity polynomial in 1/c, 1/6 and the length of the input. For a pac-learning algorithm
to have polynomial time complexity, it must take time polynomial in the number of examples
received to compute a good hypothesis.
In sum, in order for a pac-learning algorithm to be computationally efficient as defined
by the pac-learning model, two conditions need to hold:
1. The number of examples obtained, that is, the sample complexity, should be polyno-
mial in 1/e, 1/6, and the length of the input.
2. The time taken by the algorithm to compute a good hypothesis, that is, the compu-
tational complexity, should be polynomial in the number of examples received.
Henceforth, when we refer to pac-learning, we will be referring to computationally effi-
cient pac-learning, unless otherwise noted.
We make an assumption that c is sufficiently small. It is always true that
l/e> -l/ln(l-e).
For e < .39841, however, it is also true that
1/(40 < -l/ln(l-20-
We thus assume e is less than .39841; this assumption allows us to express our non-
asymptotic lower bounds in the more standard form as a function of 1/e. For instance,
in Theorem 2 we substitute the weaker lower bound m > (l/4e)hi(l/6) for the stronger
m > (-l/ln(l - 2e))ln(l/£), where m is the number of examples obtained.
1.3.3 Protocols for obtaining examples
Given the pac-learning model, there are numerous ways in which the learner can obtain
information about the target concept.
In the most traditional form of pac-learning (used, for example, by Blumer et al. [8]
and Valiant [18]), the protocol for receiving information is for the learner to draw random
examples according to a fixed but unknown probability distribution on the sample space.
(By unknown probability distribution, we mean unknown to the learner.) When the learner
requests a sample point, the learner receives back an example drawn according to this
unknown probability distribution. The example is labeled as either a positive or negative
example of the concept. Any probability distribution on the sample space is possible. Since
the complexity bounds must hold no matter what the unknown probability distribution is,
this type of learning is also called distribution-free learning.
A second protocol is to obtain information through queries. A model in which the
learner asks for information is called learning using queries. Although touched upon by
Valiant [18], learning using queries was left largely unnoticed until Angluin published Types
of Queries For Concept Learning [1] in 1986. In her paper, Angluin defines a wide variety
of protocols for receiving information about the target concept. The protocols all have in
common the fact that they allow the learner more control over what examples are seen. For
instance, the learner can choose a particular example and ask whether it is in the target
concept; this is called making a membership query. In general, the goal in learning by query,
in contrast to pac-learning, is to achieve exact learning, that is, to find the exact target
concept.
This thesis looks at models in which information is obtained by both drawing and choos-
ing examples.
1.4 New Definitions
1.4.1 Dense-in-itself concept classes
We are interested in the sample complexity necessary to pac-learn what we call a dense-in-
itself concept class C.
Definition 1 A concept class C is called dense in itself if for all concepts c e C, for all
7 > 0, and for all finite measures fi on X, there is another concept c' eC (that is, c' ^ c)
such that n(c © c') < 7.
10
A measure is similar to a probability distribution, in that it assigns weight to a sample
space. However, in a probability distribution all the weight must sum to 1; in a finite
measure, the weight can sum to any finite value.
The existence of dense-in-itself concept classes is easily proved using techniques by Ben-
David, Benadek, and Mansour in A Parametrization Scheme For Classifying Models of
Learnability [7]. They first define a notion Cov D (A,B), such that one set of concepts A
covers another set of concepts B with respect to a given probability distribution D, if for
every c > and b € B, there exists an element a € A such that d D (a,b) < e. That is,
for every b G B, we can always find an a € A such that the probability of the symmetric
difference of a and 6 is less than e. Note that as they define it, a and b may be the same
concept.
We modify their definition to handle the case when A and B are the same concept class.
In particular, we adapt their notation, and say that a set C is dense in itself if Cov^ D (C, C),
where Cov!^ D (C,C) means that for every e > and c € C, there exists an element c' € C,
c^ c', such that d D (c,c') < e. Observe here the extra specification that the two concepts c
and c' cannot be the same. In sum, to make the analogy from their definition of Cotv/j(A, B)
to our definition Cov^ D {C,C) of dense-in-itself, we add two conditions: the concept classes
A and B must be the same, and the two concepts a and b must be different.
Ben-David et al. also define the notion Cov s {A,B). By definition, Cov s (A,B) holds if
"for every b € B there exists a sequence (oj : i € N) of elements of A such that lim^oo a< =
6" [7]. That is, for every point x, there are only a finite number of concepts in the infinite
o, sequence that disagree with b on their classification of x. Observe that the notion
Cov s (A, B) depends solely on the nature of the sets A and B and is independent of any
probability distribution.
We define an analogous notion Cov' s (C, C), similar to Cov s (A, B), except again for the
two conditions that the concept classes A and B must be the same, and that each of the
concepts a,- must differ from 6.
The notion Cov' s (C,C) applies to the concept classes that we consider. For instance,
for the concept class of rectangles in the plane, we can think of a series of rectangles, each
one containing the one before it, such that the series of rectangles converge to the last
rectangle in the sequence, which is the target rectangle b. Similarly, we can think of a series
of hyperplanes that are all parallel to and increasingly close to a target hyperplane (where
close is measured by perpendicular distance to the hyperplane).
Ben-David et al. prove that Cov s (A, B) implies Covv D (A, B). Their proof, essentially
without modifications, also shows that Cov' s (C,C) implies Cov!„ D (C,C). Thus since the
11
concept classes we consider have the property Covs(C,C), the concept classes we con-
sider have the property Cov^ D (C, C). Recalling that if a concept class has the property
Cov!i D (C,C), then the class is dense in itself, we have that the concept classes we consider
are dense in themselves. Some examples of concept classes that are dense in themselves are
half-spaces in n-dimensional space, rectangles in the plane, and the concept class I.
Recall that, by definition, a dense-in-itself concept class is guaranteed to contain, for
any finite measure fi and concept c, a concept d £C,c^d, such that the measure fi of the
symmetric difference of c and c' is arbitrarily small. Now the probability measure error rate,
which is defined in Section 1.1, is a valid candidate for measure ft above, since error rate
is a finite measure. Thus, substituting in the above definition error rate for fi and target
concept c» for c, we have: for any given dense-in-itself concept class C and arbitrary target
concept c*, we are guaranteed that there exists a concept c' with error rate (d D (c t ,c')) of
arbitrarily small size. For example, if we are given a target concept c, and wish to ensure
that the error rate of c, and some other concept is less than 10~ 6 , a dense-in-itself concept
class guarantees us that there exists a concept d such that (d D (c t ,c') < 10~ 6 ).
The following lemma provides a key property of a dense-in-itself concept class.
Lemma 1 For every algorithm A that pac-learns a dense-in-itself concept class C using
random examples and membership queries, and every e, 6, target concept c. in C, /3 > 0,
and probability distribution D, there exists a concept c inC — {c*}, such that
Pr(A(e, 6) sees examples in c* © c) < /? .
Proof: Let S = c„ © c. Assume A runs under probability distribution D. We prove the
lemma using Definition 1. Our strategy is to show that the probability measure
Pr(i4(e, 6) sees examples in c* © c)
is bounded above by the measure of the expected number of points that algorithm A(c, S)
sees in 5. Then, since by definition of dense in itself we can make the measure of the
expected number of points that algorithm A(e, S) sees in 5 as small as we like, it follows
that we can make Pr(A(e, S) sees examples in c, © c) as small as we like too.
We define fi(S) to be the expected number of points that algorithm A(e,6) sees in S:
fi(S) = E( number of points A(e,S) sees in S) . (1.2)
Then fi is a measure denned on X. Let P t equal the probability that A(e, 6) sees i examples
in S. Then
oo
Pr(A(e, 6) sees any examples in S) = ^J ^»
t=i
12
< £•*
t=i
Since by the definition of dense in itself we can find a concept c that allows us to make
fj,(S) as small as we like (where S varies according to which c is chosen), the lemma follows.
1.4.2 Smooth probability distributions
In the original definition of the pac-learning model, all probability distributions are possi-
ble. One can speculate that the fact that learning must be done under such a model drives
up sample complexity. For instance, the fact that the fixed but unknown probability dis-
tribution could be extremely skewed may force the algorithm to obtain far more examples
than it might have to obtain if it had some information ahead of time about the probability
distribution.
We therefore consider whether having some advance knowledge about the nature of the
probability distribution might, in fact, lower the number of samples the algorithm needs to
obtain. Thus we look at the sample complexity necessary to pac-learn a concept class C
under a smooth probability distribution.
Definition 2 A probability distribution defined on R" is called smooth if there exists an
s > 0, such that for all a > 0, the probability associated with a region of volume a is < sa.
In other words, for every smooth probability distribution, there is an upper bound on
the amount of probability that can be associated with any region, and this upper bound is
in direct proportion to the size of the region. For instance, for the concept class 1, the value
sa is an upper bound on the probability of a region of length a. Similarly, in n-dimensional
space, the value sa is an upper bound on the probability of a region of volume a. Observe
that under a smooth probability assumption, certain highly irregular distributions are not
possible. For instance, there is a limit to how much probability weight can be found in a
very small region, and there can be no probability weight assigned to a point.
1.5 Pac-learning Background
Some pac-learning history will help the reader view our results in light of past work.
We recall that pac-learning imposes two complexity requirements: (1) the sample com-
plexity, which is the number of samples drawn, must be polynomial in the length of an
13
input instance and in the inverse of the error and confidence parameters, and (2) the com-
putational complexity, which is the computational time to find a good hypothesis (with
good defined using the e and 6 parameters), must be polynomial in the number of examples
received.
The first classes shown to be pac-learnable under the distribution-free model were all
classes of Boolean functions [18]. These classes— such as fc-DNF and fc-CNF— all have finite
sample space, for example, {0,1}". In 1986, Blumer et al. extended the notion of what
classes could be learned under the distribution-free pac-learning model to include classes
in Euclidean n-dimensional space, such as half-spaces and hypercubes [8]. Significantly,
concept classes in Euclidean n-dimensional space, in contrast to classes of Boolean functions,
have uncountably infinite sample space: the sample space is R n . (Observe, however, that a
representation of the concept, for instance, a defining hyperplane, can be specified finitely
with a finite number of real-valued parameters.) Blumer et al. gave upper and lower
polynomial bounds on sample complexity for these classes, and described polynomial- time
algorithms to learn certain Euclidean n-dimensional classes [8].
In fact, a wide variety of concept classes have been shown to be learnable with both
polynomial sample and time complexity. Examples of such classes include monomials over
n Boolean variables [18], and, for constant k, &-DNF [18], fc-CNF [18], and fc-decision-lists
[15] [17].
To a large extent, however, the time and sample complexity requirements are separable.
In 1986 Blumer et al. [8] showed that a concept class has polynomial sample complexity if
the VC dimension of the family of the concept class grows polynomially, and that if the VC
dimension of a family of concept classes grows faster than polynomially, then the concept
class is not pac-learnable [8].
There have also been significant results in the area of computational complexity. Two
of the results mentioned here are based on differing definitions of the pac-learning model:
one in which the hypothesis concept must come from the same concept class as the target
concept, and one in which the hypothesis concept may come from a different class than the
target concept.
In 1986, Pitt and Valiant [16] showed that given a pac-learning model in which the
output hypothesis concept must come from the same concept class as the target concept,
the computational complexity to pac-learn certain concept classes is not polynomial. They
proved that if finding a consistent hypothesis from the same class as the target concept
class is JVP-hard, then the given concept class is not pac-learnable unless RP - NP. For
instance, they showed that since finding a consistent 2-term DNF hypothesis for the 2-term
14
DNF concept class is NP-hard, 2-term DNF is not pac-learnable, assuming RP ^ NP.
Kearns and Valiant proved computational complexity results for the pac-learning model
in which the hypothesis concept may come from a different concept class than the target
concept. They showed that even if the output hypothesis can come from a class outside
the target concept class, the computational complexity to pac-learn certain concept classes
is not polynomial. This result depends on certain complexity-theoretic cryptographic as-
sumptions. Thus Kearns and Valiant showed that certain classes have polynomial sample
complexity but not polynomial time complexity. Examples of such classes are Boolean
formulas and finite automata [14].
Even when a class is pac-learnable, however, the traditional distribution-free model
of pac-learning places certain restrictions on the ways in which a learner can learn, and
these restrictions potentially drive up the sample complexity. For instance, the learner is
restricted to receiving examples at random according to a fixed but unknown probability
distribution: the learner has no choice over the particular examples seen. Additionally, the
upper bounds for sample complexity must hold no matter what probability distribution the
examples are drawn under: the bounds are worst case and so must hold even for extremely
unusual distributions. These limitations in the traditional pac-learning model have caused
researchers to look at ways of varying the model.
One of the principle means of varying the model is to allow the learner to ask queries, or
questions, of various sorts. Angluin and others have shown that numerous concept classes
that are not pac-learnable using the traditional model of drawing examples can be exactly
learned in polynomial time under various query models. An example of such a class is the
class of regular languages [1] [2]. Examples of classes that the membership-query model
in particular can be used to learn are monotone DNF, monotone CNF [3] [18], read-once
Boolean formulas [4], and the union of two half- spaces [5]. In sum, the extra power inherent
in the query model has proven sufficient to learn classes that are not pac-learnable under
the random example model.
1.6 The Results of this Thesis
This thesis examines two variations on the standard distribution-free model.
The first way we vary the model is to allow the learner to choose examples as well as
draw random examples. Here, we combine the traditional pac-learning protocol of drawing
random examples with the membership-query protocol of choosing examples to propose new
models for learning using queries.
15
The second way we vary the model is to give the learner some advance knowledge about
the probability distribution. We propose a model in which the learner knows a priori that
the probability distribution is smooth. Since a smooth probability distribution is one that
guarantees a specified upper bound on the amount of probability that can be associated
with any region, certain distributions, such as extremely skewed distributions, are ruled
out.
There are two key ways our new models differ from other query models. First, research
on learning using queries has focused on learning classes that do not have polynomial com-
putational complexity, and so are too hard to learn under the pac-model. This thesis looks
at classes already known to be pac-learnable, and so which are already known to have
polynomial sample complexity. The issue addressed then is not whether the given concept
class has polynomial sample complexity, but rather whether using membership queries in
conjunction with random examples can reduce the sample complexity of the concept class.
Second, most research on query models has focused on exactly identifying target concepts
from classes with finite sample space. This thesis, however, looks at pac-learning classes
which may have uncountably infinite sample space. So whereas membership queries are
traditionally used to help exactly identify a concept from a finite class of concepts, we ask
whether membership queries can help pac-learn (that is, approximately rather than exactly
learn) a potentially infinite, even uncountable, concept class.
In Chapter 2, we show that under the distribution-free pac-learning model, the power to
both choose and draw random examples cannot significantly reduce sample complexity. In
contrast, in Chapter 3, we show that under a model in which the learner has some limited
information about the nature of the probability distribution, the power to both choose and
draw examples does lead to a significant reduction in sample complexity. In Chapter 4, we
state conclusions and open problems.
One of our main results thus is that for traditional pac-learning models, membership
queries do not help: the sample size required to pac-learn a dense-in-itself concept class if
the learner can both choose and draw examples is within a constant factor of the sample size
needed if the learner can just draw examples. Significantly, this lower bound is established
for all probability distributions (as long as the probability distribution is unknown to the
learner), not just for one worst-case probability distribution. On the other hand, we get
significant results in favor of membership queries if the learner knows a priori that the
probability distribution is smooth: if the learner can only draw examples, then the lower
bound on sample complexity for pac-learning many concept classes does not improve, but
if the learner can both choose and draw examples, then the sample and computational
16
m
IT
Chapter 2
A Lower Bound When Both Drawing and
Choosing Examples
Up until now, when we have referred to pac-learning, we have meant the traditional model
of pac-learning, in which the protocol for obtaining information about the sample space is
to draw random examples. Henceforth, we consider any number of protocols to be viable
ways of obtaining examples when pac-learning. In particular, we consider protocols that use
random examples alone, membership queries alone, or a combination of both. We always
specify the protocol being used.
2.1 A Lower Bound
Theorem 1 gives a lower bound on the number of examples required to pac-learn a dense-
in-itself concept class C by any algorithm that can both draw random examples and make
membership queries.
Theorem 1 For every algorithm A that pac-learns a dense-in-itself concept class C using
random examples and membership queries, and for every e, 8 > 0, probability distribution
D, and target concept c» £ C, the total number of examples A needs to pac-learn C is
ft((l/e)ln(l/«)).
Note that A may be either randomized or deterministic, and it may draw examples
and make membership queries in any order. The bound is on the worst-case number of
examples seen, where the worst case is taken over the runs of the algorithm for the given
target concept, probability distribution, e, and 6. We assume in our asymptotic notation
that e, 6, and (3 (to be defined shortly) are going to zero in the limit.
Proof: We begin by assuming that a pac-learning algorithm A, an e, a 6, a probability
distribution D, and a target concept c» have been given. We claim that we, as adversary,
can produce a probability distribution D' and a concept c' t that are close enough to D and
18
C, respectively, that if A does not draw Q((l/e)ln(l/<5)) random examples to distinguish
these cases, then Pi(d D ,(c' t ,c) > e) > 6, so that A is not a pac-learning algorithm.
Let (3 be an arbitrarily small positive number, and let c' t be a concept different than c,
such that
Pr(A sees examples in c, © c'„ | A draws according to D) < f3 .
The existence of c'„ is guaranteed by Lemma 1. Then
Pr(A sees no examples in c* © c' t \ A draws according to D) > 1 — /? .
Note that if A sees no examples in c t ®c' t , then A cannot distinguish between target concept
C and target concept c'„.
We now choose any point w in c, © c'„. Define the probability distribution D,„ to be the
distribution that assigns probability 1 to point w and probability to all other points.
Given probability distributions D and D w , we make a new probability distribution D' as
follows. We "skim" e probability off of D, while keeping the relative distribution of points
in D the same. That is, we uniformly remove e probability from the probability distribution
D such that the relative weight of points under D remains the same, but the total weight
of points in D now adds up to 1 — e. We assign the excess e probability to w. Thus our
new probability distribution D' is a linear combination of probability distribution D and
probability distribution D w :
D' = {\ -e)D + eD w .
From now on, we suppose that algorithm A is drawing its random examples according
to D'. When drawing an example according to D' , there are two possibilities: either the
example is drawn "as if according to D" (this happens with probability 1-e), or the example
is drawn "as if according to D w n (this happens with probability c). In general, we say that
"A draws as if according to D" if A draws all of its random examples as if according to D.
Suppose that we run algorithm A on distribution D' and target concept c», that A
draws all of its examples as if according to distribution D, and that A sees no examples
in c„ © c' t . By definition, a pac-learning algorithm with high probability correctly classifies
any point with weight a or greater. Because A is a pac-learning algorithm and w has weight
e, A's output hypothesis c classifies w correctly with high probability. Thus c with high
probability agrees with c t (w), and so disagrees with c'„(w).
We note that, on the one hand, c'„ is consistent with target concept c„ on all examples
algorithm A sees with probability at least 1-/3, yet, on the other hand, c' t disagrees with
19
c. on the classification of w. Since A is a pac-learning algorithm, therefore, when A now
tries to learn c'„ instead of c, on distribution D', we have that its output concept c satisfies
Pr(c(w) ^ c' t (w) | A draws as if according to D) > (1 - 6 — j3) , (2.1)
since we are assuming that A pac-learns c» on D correctly, and with probability 1-/3 the
run looks the same as it would if c„ were the target concept. Observe that if A runs under
probability distribution D', then saying c(w) ^ c',(w) is equivalent to saying dzj<(c,c'J > e.
Let m = m(e, 6) denote the maximum possible total number of examples drawn and/or
chosen by any run of A on inputs e and 6, and any probability distribution and any target
concept. Since random examples are independent events, if all m examples obtained are
random examples, then
Pr(A draws as if according to D) = (1 — e) m .
If less than m examples are random examples, then
Pr(j4 draws as if according to D) > (1 — e) m .
Assume that m is so small that it satisfies
(l-e) m > S
(1-6-/3) '
then
S
Pt(A draws as if according to D) > — -r .
We show that this assumption implies that A does not pac-learn.
Assume that in a given run A receives information about m sample points, by random
example according to probability distribution D', by membership query, or by both. What
then is the probability that during a run of A (on input distribution D', with target concept
c' t ) the following events both happen?
E = A misclassifies w (and so makes error at least e).
F = A draws all its random examples as if according to D.
Now,
Ft(EF) = Pr(E | F)Pt(F) .
Thus, since (by (2.1))
Pt(E I F) = Pr(A misclassifies w | A draws as if according to D) > 1 - 6 — f3 ,
20
and
Pr(F) = ?t(A draws as if according to D) > (1 - c) m > ,_ 6 _ g »
we have
?r(EF)>(l-6-3)- 1 _ 6 6 _ =6.
Since Pr(£) > Vt(EF) > 6, we have Yr(d D >(J„c) > e) > 6, and therefore A has not
pac-learned.
Solving for m in (1 - e) m > 6/(1 -6-/3), we have
mln(l - e) > ln(£) - ln(l - 6 - /?),
so
Thus, unless
" s to(f^j (to< ?' +h ( 1 -'-*»■
in other words, unless m = n((l/€)ln(l/^)), Pr(d/>-«,c) > c) > ^, which proves our
theorem.
2.2 Comparison of the Lower Bound to Previous Results
We already know from Ehrenfeucht et al. [12] that for c < 1/2, an algorithm that can only
draw random examples must see at least
n^l/^')
examples to pac-learn, where VCdim(C) is the Vapnik-Chervonenkis dimension of the class
C. Since our bound is within a constant factor of their lower bound (treating VCdim(C)
as a constant as we let e and 6 go to zero), we have shown that the minimum number
of examples needed by an algorithm A that can both draw random examples and make
membership queries is within a constant factor of the minimum number of examples needed
by an algorithm B that can only draw random examples.
21
We remark that, compared to the proof given by Ehrenfeucht, Haussler, Kearns, and
Valiant [12], our proof has the advantage that it applies to every probability distribution (not
just one constructed by the adversary), with the understanding of course that the learning
algorithm has no prior knowledge about the actual probability distribution, and that the
learning algorithm must pac-learn for every distribution. It also applies to learning from
both random examples and membership queries (not just learning from random examples).
On the other hand, our proof does not yield a lower bound that grows with the VC dimension
of the concept class C.
22
Chapter 3
Smoothness Guarantees
Until now, we have assumed that any probability distribution was possible. Now suppose
we have a "smoothness guarantee." Specifically, we assume that the learning algorithm has
an additional input, s, and is told that any region of X having volume a, for any a, has
probability at most sa. Can this assumption improve our lower bound? Our first claim
is that if the learning algorithm uses random examples only, then a smoothness guarantee
does not reduce the lower bound on the number of random examples required to pac-learn.
We say (C,D) (where C is a concept class on a domain X and D is a probability
distribution on X) has all error rates if for every target concept c* £ C, and for every
nonnegative c < sup ceC {rf J5 (c»,c)}, there is a concept c £ C such that d D (c+,c) = e. For
instance, if D is smooth, and C is the concept class of half-spaces of R n , rectangles in the
plane, or X, then (C,D) has all error rates.
3.1 A Lower Bound When Drawing Examples Only
Theorem 2 For every algorithm B that draws random examples to pac-learn C , sufficiently
small e > 0, 6 > 0, and smooth probability distribution D, if (C,D) has all error rates then
the number of examples B needs to draw is greater than (l/4e)ln(l/£).
Thus, even though an algorithm B has information that D is smooth, if C is a concept
class such that (C,D) has all error rates, B still needs to draw (l/4e)ln(l/<5) examples to
pac-learn C. Note that while the assumption that a concept class C under distribution D
has all error rates may seem restrictive, as a practical matter it is quite reasonable, given
the assumption of a smooth probability distribution. We leave it as an open problem to see
whether this assumption actually is a restriction.
Proof:
Let c be a concept such that d D (c+, c) = 2e. (We have specified in the statement of The-
orem 2 that e must be sufficiently small. To be more precise, if e < l/2(sup c€C {d D (c t: , c)}),
23
then by the assumption that (C, D) has all error rates, there will always be a concept c with
error rate 2c. Adding this constraint to our earlier constraint in Section 1.3.2 that e must
be less that .39841, we have c < min( .39841, l/2(sup e€C {d/>(c„c)})).)
Let S = c* ® c. As before, let c be the hypothesis concept that a run of algorithm B
outputs, and let c^ be the concept that we as adversary choose.
We begin by showing that if B sees no points in 5, an adversary can force B to have
error rate at least e. We conclude by showing that unless B draws at least (l/4e)ln(l/£)
examples, then with probability greater than S the algorithm will see no points in S; in
other words, with probability greater than 6 the algorithm will have error rate at least e.
Given that B sees no points in S, we assume without loss of generality that c is more
likely to agree with c„ (in the probabilistic sense) than with c on points in S. Then
Pt(c(x) = c,(x) : x G S) > 1/2 ,
where the probability is based on the conditional distribution induced on S by distribu-
tion D.
In this case, we set c' # to c. (If c is more likely to agree with c, we set c'„ to c„; the
reasoning for the rest of the proof remains the same.) Thus
Pr(c(x) ^ c'Xx) : x e S) > 1/2 ,
where the probability again is based on the conditional distribution induced on S by distri-
bution D.
Since d D is a bounded pseudometric, the triangle inequality holds for do, giving
dD(c m ,c) + d D (c,c',) > d D (c t ,c' t ) . (3.1)
Now, by assumption,
d D (c,c:)>d D (c.,c). (3.2)
By (3.1) and (3.2), we have
2d D (c,c:) > d D (c.,c:) , or d D (c,c',) > (Mc<))/2 • (3.3)
Also, by assumption, we have
24
d D (c„c) = 2e. (3.4)
Since we have set c', to c, (3.4) is equivalent to
<k(c..O = 2e. (3.5)
Thus, combining (3.3) and (3.5), we obtain
d D (c,c? t )> 2e/2 = e,
which shows that if B sees no points in S, B has error rate at least e.
The probability that B independently draws m examples and fails to see a point in 5,
that is, the probability that B has error rate at least e, is (1 — 2e) m . We fail to pac-learn
if the probability of error rate at least e is greater than S, that is, if (1 - 2t) m > 6; given
our assumption that e < .39841, we thus have that the lower bound on m even with the
smoothness guarantee is (l/4e)ln(l/£). I
3.2 Upper and Lower Bounds When Both Drawing and Choos-
ing Examples
So far in this chapter, we have shown that a smoothness guarantee does not significantly
reduce the lower bound on sample complexity if the learning algorithm can only draw
random examples. We now show, on the other hand, that if the learning algorithm can also
make membership queries, then there exists a fast pac-learning algorithm whose sample
and computational complexity upper bounds are well below the lower bounds proved in the
previous section. We first prove this result for a one-dimensional concept class, and then
extend the result to an n-dimensional concept class. (Our generalization to n dimensions is
not completely general; we prove the result only for a specific concept class. It is an open
problem to prove such a result for general bounded n-dimensional domains.)
3.2.1 Using the binary-search approach in one dimension
Theorem 3 There exists an algorithm A that pac-learns I using membership queries, such
that for every e, 6, and smooth probability distribution D with known smoothness constant
s, the number of examples A needs to pac-learn X is lg(s/e). For deterministic algorithms
lg(s/c) is also a lower bound on the sample complexity. Additionally, if we assume unit cost
for each example drawn or chosen, a deterministic algorithm A runs in time 0(lg(s/c)).
25
Note that although under this model the algorithm may both draw and choose examples,
the algorithm we present only chooses examples.
Proof: The proof is based on a "binary search." We call a hypothesis interval an interval
defined by two points, which we call L and R. Here L is the value of the rightmost positive
example seen so far, and R is the value of the leftmost negative example seen so far. Thus
the interval [L,R] represents the points that we don't know how to classify yet; the right
endpoint of c*, our target concept, is somewhere within this interval.
In order to guarantee that the probability sa associated with a region of volume a is
less than e, we set up the following equation, which gives us the requisite size for a:
sa < e, or
a < e/s.
So if A has narrowed [L,R] to a size less than e/s, A is guaranteed that the probability
associated with [L,R] is less than e/s • s, or e, and therefore can output any c with right
endpoint in [L, R] with total confidence (<S = 0) that the error rate is less than e. Algorithm
A can use membership queries to do a binary search on the initial interval [0,1], in which
case A needs to make only lg(s/e) membership queries in order to narrow [L,R] to a size
less than e/s. Observe that this upper bound of \g(s/e) on the sample complexity when
using membership queries is substantially lower than the lower bound of (l/4e) ln(l/<5) on
the sample complexity when using random examples only.
To see that lg(s/e) is also a lower bound on the sample complexity for deterministic
algorithms, consider an adversarial argument. An adversary can always label the requested
examples so that after the algorithm makes lg(s/2e) queries, an interval [L, R] of size 2e/s
remains. At this point, the adversary can always force the algorithm to output a hypothesis
with error rate at least e. For instance, even if the algorithm outputs the concept represented
by the midpoint of the interval, the adversary can choose the concept represented by L.
Then, because the interval between L and the midpoint has length e/s, in the worst case
the hypothesis will have error rate s • e/s = e. Since lg(s/2e) = lg(s/e) — 1, we have that
greater than lg(s/e) — 1 queries are needed to ensure that in all cases the error is less than
e; thus, at least lg(s/c) queries are necessary.
If we assume it takes constant time for each membership query, then we also achieve the
tight bound of 0(lg(s/e)) on the computational complexity for deterministic algorithms.
26
3.2.2 Generalizing the binary-search approach to n dimensions
So far, we have given an algorithm which pac-learns the one- dimensional class T under
the assumption of a smooth probability distribution. In short, such an algorithm pac-learns
"half-spaces" of a one-dimensional line segment. We would also like to show how to pac-learn
an analogous 7i-dimensional class. That is, we would like to pac-learn the set of half-spaces
of an n-dimensional object, where the n-dimensional object is a logical generalization of the
one-dimensional line segment.
There are a number of objects that come to mind as logical generalizations of the one-
dimensional line segment. Two such generalizations are the hypercube and the hypersphere.
But perhaps the simplest generalization of a one- dimensional segment is the simplex. We
will generalize the binary-search approach of the above section to learn half-spaces of a
simplex.
A simplex is a finite region of rc-space enclosed by n + 1 hyperplanes, where each hy-
perplane has dimension n — 1. Importantly, a simplex has only n+1 corner points, n + 1
faces, and ("^ ) edges. Each corner point is connected to each other corner point by an
edge. For instance, a simplex in one dimension is a line, which is enclosed by two points
(a point has dimension zero). A simplex in two dimensions is a triangle, which is enclosed
by three lines. A simplex in three dimensions is a tetrahedron, which is enclosed by four
planes, and so on [11]. A regular simplex is one in which each edge has the same length.
A simplex thus has certain properties that make it easier to use as the generalization of
a line segment than a hypercube or a hypersphere. For instance, observe that the number of
faces, corner points, and edges of a simplex is linear in the dimension. This contrasts sharply
with a hypercube, for example, in which the number of edges and corner points is exponential
in the dimension. Our algorithm capitalizes — in fact, depends — on these properties of the
simplex. We leave it as an open problem to pac-learn the set of hyperplanes which intersect
a hypercube or a hypersphere.
Let n represent the dimension of the concept class.
Theorem 4 Let S be the class of hyperplanes which intersect a regular simplex. There
exists an algorithm A that pac-learns S using membership queries, such that for every e, 6,
and smooth probability distribution D with known smoothness constant s, the number of
examples A needs to pac-learn S is n 2 (lg(s/e) + 4). Additionally, if we assume unit cost for
each example drawn or chosen, A runs in time 0(n 7 s/e).
Proof:
27
We consider a regular simplex in which each edge has length 1, and we assume that the
coordinates of the simplex corners are given to the learning algorithm. The domain X of
S consists of exactly those points on the boundary of the simplex and within the simplex.
Given a target hyperplane that intersects a simplex, we consider the target hyperplane and
all domain points on one side of the hyperplane to be positive examples of the concept,
and all domain points on the other side of the hyperplane to be negative examples of the
concept. Thus concept class S is represented by the set of hyperplanes of n — 1-dimensions
that intersect the given simplex of n dimensions.
We begin our analysis by giving the formulas for computing the height and volume of a
regular simplex whose edges have length 1.
The recurrence formula for the height of such an n-dimensional simplex is:
H n+1 = Jl - (E n • n/(n + l)) 2 .
This evaluates to
H„ = y/(n + l)/2n.
The volume V n of an n-dimensional simplex of edge-length 1 is equal to the base of the
simplex (B n ) times the height of the simplex (/?„) divided by the dimension n:
V n = B n • H n /n.
Since the base B„ of an n-dimensional simplex is equal to the volume F„_i of the
corresponding n - 1-dimensional simplex, the recurrence formula for V n is:
V n = K-x • Hjn.
This evaluates to
V n = y/(n + l)/2»/n!.
The surface area S n of an n-dimensional simplex is equal to the number of faces of an
n-dimensional simplex (n + 1) times the volume of an n — 1-dimensional simplex (K»_i):
5„ = (n + 1) • K,_!
28
(n+l)-y / n/2"- 1 /(«-l)!.
For any target hyperplane which cuts the simplex, the learning algorithm A should
output a hypothesis hyperplane such that the error rate of the hypothesis is less than e.
Consider first the classification of each of the simplex corner points. In the trivial case
where all n + 1 corner points are labeled positive, all points inside the simplex are also
positive. In this case any hypothesis which classifies all simplex points as positive suffices.
(Similarly if all n + 1 corner points are labeled negative.) Otherwise, there is at least one
positive corner point and one negative corner point.
Let k be the total number of positive corner points, and let / be the total number of
negative corner points. Then k + I = n + 1 = the total number of corner points. Since each
corner point is connected to each other corner point, the total number of "mixed" edges,
that is, edges such that one endpoint is positive and one is negative, is kl.
The separating target hyperplane must cut each of these kl mixed edges. In fact, the
set of simplex edges that the target hyperplane cuts is precisely this set of kl mixed edges.
(All other edges have both endpoints positive or both endpoints negative, and clearly the
separating hyperplane cannot cut these edges.)
By definition, the length of each edge of the simplex is 1. We wish to find a much smaller
interval, which we call a cut-interval, on each mixed edge. A cut-interval is a sub-interval of
a mixed edge that still retains the property that one endpoint is positive and one endpoint is
negative. Since a mixed edge is one dimensional, we can apply the binary-search techniques
of the previous section to find a cut-interval of a specified size on the mixed edge.
Note we only need n points to determine an n — 1-dimensional hyperplane. By doing a
binary search along each of the kl mixed edges, we find kl cut-intervals (rather than points).
Since kl is always at least n, these kl cut-intervals define a constrained n-dimensional region
within which the hyperplane must lie. (By definition of a simplex, and the fact that each
of the kl cut-intervals comes from a different edge of the simplex, we are guaranteed that
the kl regions will indeed define an n- dimensional region.)
The overall strategy of the algorithm, then, is to find a cut-interval of a predetermined
size, which we call width, on each of the kl mixed edges, and then output any hypothesis
consistent with each of the cut-interval constraints.
We present the algorithm, give a derivation for the value of the key variable width,
prove the correctness of the algorithm, and finally give the sample complexity (number
29
Figure 3.1: Simplex with pancake region within the dashed lines; w = width.
of membership queries needed) and computational complexity (time to find a hypothesis
consistent with the constraints) analysis.
3.2.2.1 The algorithm
The algorithm begins by making n + 1 membership queries to obtain the classification of
each corner point.
For each of the resulting kl mixed edges, the algorithm does a binary search using
membership queries to find a cut-interval of size width (width is precisely defined shortly).
Since the length of each simplex edge is 1, the maximum number of queries necessary to do
each binary search is lg(l/ width).
Each of the two endpoints of each cut-interval determines a constraint on where the
target hyperplane can lie, so there are 2kl constraints total. Any hypothesis consistent with
the constraints is output.
3.2.2.2 Derivation of width
In order to analyze the sample and computational complexity, we need to determine the
size of width necessary to achieve an error rate less than e.
We form a "pancake" as follows: consider the n — 1-dimensional target hyperplane and
a one-dimensional line perpendicular to this hyperplane. If we move the target hyperplane
along the perpendicular a distance of width in each of the two directions away from the
target hyperplane, then we trace out a pancake-like region with total width 2 • width. (See
Figure 3.1.)
We claim that the only points on which the target hyperplane and any hypothesis
consistent with all kl constraints can disagree must lie in this n-dimensional pancake that
30
encloses the target hyperplane. We thus wish to choose a value of width so that the entire
pancake has total probability less than e.
The volume of our pancake is clearly less than or equal to the width of the pancake times
the volume of the maximum cross section of the pancake. The volume of the maximum
cross section of the pancake in turn is less than or equal to the volume of the maximum
cross section of the simplex, which in turn is less than or equal to the surface area S n of the
simplex. (We conjectured that the maximum cross section of a simplex is at most equal to
the surface area of a single face of the simplex, but could not prove this conjecture.) Thus
the volume of the pancake must be less than or equal to the width of the pancake, which is
2 • width, times the surface area of the simplex, which is 5„.
We wish that the total volume of the pancake should have probability weight less than
6. So given smoothness constant s, it is necessary to have
s ■ (2 • width ■ S n ) <e. (3.6)
This is equivalent to
s • 2 • width • (n + 1) • K-i < e,
and to
s • 2 • width • (n + 1) • ^/n/2 n -7(n - 1)! < c,
yielding (with rearrangements of terms)
width < e/s • 1/2 • (^""Vn • (n - l)!/(n + 1)). (3.7)
Any value of width which satisfies (3.7) will also satisfy (3.6), and any value of width
which satisfies (3.6) will ensure that the total volume of the pancake has probability weight
less than e. So we choose width equal to one half the right-hand side of (3.7) -
width = l/2(e/s • 1/2 • (^""Vn • (n - l)!/(n + 1))
= e/s ■ 1/4 • (^2-Vn ■ (n - l)!/(n + 1)).
- and then analyze the resulting number of membership queries required given this choice
of width.
The number of membership queries necessary to narrow a cut-interval to size width is
lg(l/width). Here, we have
lg(l/ width) = lg(s/e) + lg4 + \$(y/n/2»-i ■ (n + l)/(n - 1)!).
31
Since
lg( v /n/2»- 1 • (n + l)/(n - 1)!) < 1.6
for all n, (and in fact is less than for n > 3), and
lg4 = 2,
we have that lg(s/e) + 4 is an upper bound on lg(l/ width):
lg(l/ width) < lg(s/e) + 4
So the number of membership queries necessary on each mixed edge is at most lg(s/e)+4.
Thus, we must do a binary search making at most lg(s/e) + 4 membership queries on each
of the kl mixed edges. Since kl < n 2 (1 < k < n, 1 < / < n, thus kl < n 2 ), the total sample
complexity is at most n 2 (lg(5/f) + 4).
3.2.2.3 Correctness of the algorithm
We claim that the maximum error rate of any hypothesis hyperplane consistent with the 2kl
constraints is less than e. We prove this by showing that all the points on which the target
and hypothesis hyperplanes disagree — that is, the symmetric difference of the target and
hypothesis concepts — must lie in the n- dimensional pancake region of width 2 • width that
surrounds the target hyperplane. Since this pancake region has total probability weight less
than e, the error rate of any consistent hypothesis must be less than c.
We first show that each cut-interval endpoint lies within the pancake region, and then
show that all points in the symmetric difference of the target and hypothesis concepts lie
within the pancake.
The total length of each cut-interval is width. This implies that the projection of each
cut-interval endpoint onto the line perpendicular to the target hyperplane must also have
distance to the target hyperplane less than or equal to width. Since the pancake is created by
moving width along the perpendicular to the target hyperplane in each of the two directions
away from the target hyperplane, each endpoint thus must lie within the pancake.
We now establish that any point (not just the endpoints) on which the target and
hypothesis concepts disagree lies within the specified pancake region.
Consider the convex hull created by connecting the 2kl cut-interval endpoints. (We
remark that not all of the ( 2 *') edges formed by connecting endpoints define the convex
hull — some of these edges lie inside the convex hull.) The target and hypothesis hyper-
planes lie entirely within this convex hull, since each point where the target and hypothesis
32
hyperplanes intersect an edge of the simplex is within a cut-interval. Thus, any point in the
symmetric difference of the hypothesis and target concepts also lies within the convex hull.
Now we establish that every point within the convex hull lies within the pancake region.
As already stated, the distance from the target hyperplane to the projection of each of
the 2kl cut-interval endpoints onto the target hyperplane perpendicular must be less than
or equal to width. Since any point within the convex hull is a convex linear combination of
cut-interval endpoints, we know any point in the convex hull region must also have distance
to the target concept less than or equal to width. But then every point in the convex
hull, which in particular includes all points in the symmetric difference of the target and
hypothesis concepts, must be contained within the pancake region. Thus, since the pancake
has total probability at most e, the hypothesis error rate must be less than c.
3.2.2.4 Complexity of the algorithm
The computational complexity is equal to the time necessary to compute a hypothesis
consistent with the 2kl constraints. We can solve this constraint problem using linear
programming in time polynomial in the number of constraints, 2kl, and the precision of
each constraint, lg(l/width). The worst-case bound for solving the constraint problem
with 2kl constraints and bit precision \g{l/ width) requires time 0((2kl) 3 5 • \g(l/width)).
Since 2kl is 0(n 2 ), and ]g(l/width) is 0(s/€), we can find a hypothesis consistent with all
2kl constraints in time 0(n 7 s/c).
I
hull — some of these edges lie inside the convex hull.) The target and hypothesis hyper-
planes lie entirely within this convex hull, since each point where the target and hypothesis
32
Chapter 4
Conclusions and Open Problems
We have shown that membership queries do not help us to pac-learn a wide variety of
classes. We have also shown that a priori knowledge of a smooth probability distribution
does not help us to pac-learn many concept classes, when we are only allowed to draw
random examples. A smooth probability distribution, however, does help us to pac-learn
certain of these classes when we can also make membership queries. In particular, we have
shown a significantly reduced upper bound on the number of samples needed to learn I,
which is a concept class in one dimension, and «S, which is a concept class in n dimensions.
We hope to generalize this upper bound to handle many other classes.
The following open problems arise from this research:
1. Improve the lower bound of Theorem 1 to depend upon VCdim(C).
2. Generalize the n-dimensional approach of Theorem 3 to handle n — 1-dimensional
hyperplanes intersecting arbitrary bounded convex n-dimensional domains (e.g. hy-
perspheres or hypercubes).
3. In regards to solving 2 above, we hypothesize that it is possible to make a reduction
from our problem to the problem solved by Maass and Turan [15] of exactly learning
a target hyperplane in a boolean-valued n-dimensional sample space. In making the
reduction we would capitalize on the fact that we are working with a smooth proba-
bility distribution and so simulate a boolean-valued domain in our continuous space.
We would choose a small enough value for the distance between the boolean-valued
points, so that if an algorithm exactly learned the target hyperplane, the error rate
in the continuous space would be less than e. This reduction is not worked out, but
if correct, would give good bounds on learning half-spaces in n-dimensional space.
34
Chapter 5
Acknowledgments
I first of all thank Ron Rivest. He has been a superb and supportive advisor.
I thank Avrim Blum, Mic Grigni, Alex Ishii, Joe Kilian, James Park, Rob Schapire, and
Robert Solovay for their helpful discussions and comments.
I thank Jonathan Amsterdam, Aditi Dhagat, Sally Goldman, Michael Kearns, Norman
Margolus, and Rob Schapire for proof-reading my paper and giving me advice on my COLT
talk.
Thanks especially to Tandy Warnow for always being such a nurturing and supportive
friend-teacher, and to Phil Rogaway for always giving me such good advice.
Special thanks to Be Hubbard both for all her help to me personally and for helping to
create and maintain the warm, supportive, thriving theory community.
I also thank all my other wonderful friends not already mentioned above, including:
Wendy Brunzie, Carolyn Cox, Ann Feehan, Shalom Franck, Debra Goldentyer, Deborah
Greif, Agiua Heath, Maya Kopell, Suzanne Lawrence, Mojdeh Mohteshemi, Sharon Port,
Karen Saracik, Mark Schaeffer, Michal Shimshoni, Mona Singh, Julie Sweedler, Margaret
and Mark Tuttle, Debbie Utley, Debra Waller, Jill Werman, Su-Ming Wu, and Guillermo
Zemba.
Most of all, I thank my wonderful parents for all their love and support.
35
Bibliography
[1] Dana Angluin. Types of queries for concept learning. Technical Report
YALEU/DCS/TR-479, Yale University Department of Computer Science, June 1986.
[2] Dana Angluin. Learning regular sets from queries and counterexamples. Information
and Computation, 75:87-106, November 1987.
[3] Dana Angluin. Queries and concept learning. Machine Learning, 2(4):319-342, April
1988.
[4] Dana Angluin, Lisa Hellerstein, and Marek Karpinski. Learning read-once formulas
with queries. Technical report, University of California, Report No. UCB/CSD 89/528,
1989.
[5] Eric B. Baum. On learning a union of half spaces. Journal of Complexity, 6(1):67-101,
March 1990.
[6] Shai Ben-David and Gyora M. Benedek. Measurability constraints on pac learnability.
Technical report, Technion, Haifa, 1991.
[7] Shai Ben-David, Gyora M. Benedek, and Yishay Mansour. A parametrization scheme
for classifying models of learnability. In Proceedings of COLT '89, pages 285-302.
Morgan Kaufmann, 1989.
[8] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth.
Classifying learnable geometric concepts with the Vapnik-Chervonenkis dimension. In
Proceedings of the Eighteenth Annual ACM Symposium on Theory of Computing, pages
273-282, Berkeley, California, May 1986.
[9] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth.
Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4):929-
965, 1989.
36
[10] Thomas M. Cover. Geometrical and statistical properties of systems of linear in-
equalities with applications to pattern recognition. IEEE Transactions on Electronic
Computers, EC-14(3):326-334, 1965.
[11] H. S. M. Coxeter. Regular Polytopes. Dover Publications, Inc., New York, second
edition, 1973.
[12] Andrzej Ehrenfeucht, David Haussler, Michael Kearns, and Leslie Valiant. A gen-
eral lower bound on the number of examples needed for learning. Information and
Computation, 82(3):247-251, September 1989.
[13] David Haussler. Probably approximately correct learning. Technical Report UCSC-
CRL-90-16, University of California Santa Cruz, May 1990.
[14] Michael Kearns and Leslie G. Valiant. Cryptographic limitations on learning boolean
formulae and finite automata. In Proceedings of the Twenty-First Annual ACM Sym-
posium on Theory of Computing, pages 433-444, Seattle, Washington, May 1989.
[15] Wolfgang Maass and Gyogy Turan. How fast can a threshold gate learn? In Drastal,
Hanson, and Rivest, editors, Computational Learning Theory and Natural Learning
Systems: Constraints and Prospects. MIT Press. (To appear.).
[16] Leonard Pitt. Recent results in computational learning theory, 1990.
[17] Leonard Pitt and Leslie G. Valiant. Computational limitations on learning from ex-
amples. Technical report, Harvard University Aiken Computation Laboratory, July
1986.
[18] Ronald L. Rivest. Learning decision lists. Machine Learning, 2(3):229-246, 1987.
[19] Leslie G. Valiant. A theory of the learnable. Communications of the ACM,
27(11):1134-1142, November 1984.
[20] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer- Verlag,
New York, 1982.
37